{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Kernel Density Estimation\n",
    "\n",
    "Kernel density estimation is the process of estimating an unknown probability density function using a *kernel function* $K(u)$. While a histogram counts the number of data points in somewhat arbitrary regions, a kernel density estimate is a function defined as the sum of a kernel function on every data point. The kernel function typically exhibits the following properties:\n",
    "\n",
    "1. Symmetry such that $K(u) = K(-u)$.\n",
    "2. Normalization such that $\\int_{-\\infty}^{\\infty} K(u) \\ du = 1$ .\n",
    "3. Monotonically decreasing such that $K'(u) < 0$ when $u > 0$.\n",
    "4. Expected value equal to zero such that $\\mathrm{E}[K] = 0$.\n",
    "\n",
    "For more information about kernel density estimation, see for instance [Wikipedia - Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation).\n",
    "\n",
    "A univariate kernel density estimator is implemented in `sm.nonparametric.KDEUnivariate`.\n",
    "In this example we will show the following:\n",
    "\n",
    "* Basic usage, how to fit the estimator.\n",
    "* The effect of varying the bandwidth of the kernel using the `bw` argument.\n",
    "* The various kernel functions available using the `kernel` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "from scipy import stats\n",
    "import statsmodels.api as sm\n",
    "import matplotlib.pyplot as plt\n",
    "from statsmodels.distributions.mixture_rvs import mixture_rvs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A univariate example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(12345) # Seed the random number generator for reproducible results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a bimodal distribution: a mixture of two normal distributions with locations at `-1` and `1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Location, scale and weight for the two distributions\n",
    "dist1_loc, dist1_scale, weight1 = -1 , .5, .25\n",
    "dist2_loc, dist2_scale, weight2 = 1 , .5, .75\n",
    "\n",
    "# Sample from a mixture of distributions\n",
    "obs_dist = mixture_rvs(prob=[weight1, weight2], size=250,\n",
    "                        dist=[stats.norm, stats.norm],\n",
    "                        kwargs = (dict(loc=dist1_loc, scale=dist1_scale),\n",
    "                                  dict(loc=dist2_loc, scale=dist2_scale)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest non-parametric technique for density estimation is the histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "# Scatter plot of data samples and histogram\n",
    "ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size)),\n",
    "            zorder=15, color='red', marker='x', alpha=0.5, label='Samples')\n",
    "lines = ax.hist(obs_dist, bins=20, edgecolor='k', label='Histogram')\n",
    "\n",
    "ax.legend(loc='best')\n",
    "ax.grid(True, zorder=-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting with the default arguments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The histogram above is discontinuous. To compute a continuous probability density function,\n",
    "we can use kernel density estimation.\n",
    "\n",
    "We initialize a univariate kernel density estimator using `KDEUnivariate`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde = sm.nonparametric.KDEUnivariate(obs_dist)\n",
    "kde.fit() # Estimate the densities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We present a figure of the fit, as well as the true distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "# Plot the histrogram\n",
    "ax.hist(obs_dist, bins=20, density=True, label='Histogram from samples',\n",
    "        zorder=5, edgecolor='k', alpha=0.5)\n",
    "\n",
    "# Plot the KDE as fitted using the default arguments\n",
    "ax.plot(kde.support, kde.density, lw=3, label='KDE from samples', zorder=10)\n",
    "\n",
    "# Plot the true distribution\n",
    "true_values = (stats.norm.pdf(loc=dist1_loc, scale=dist1_scale, x=kde.support)*weight1\n",
    "              + stats.norm.pdf(loc=dist2_loc, scale=dist2_scale, x=kde.support)*weight2)\n",
    "ax.plot(kde.support, true_values, lw=3, label='True distribution', zorder=15)\n",
    "\n",
    "# Plot the samples\n",
    "ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size))/40,\n",
    "           marker='x', color='red', zorder=20, label='Samples', alpha=0.5)\n",
    "\n",
    "ax.legend(loc='best')\n",
    "ax.grid(True, zorder=-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the code above, default arguments were used. We can also vary the bandwidth of the kernel, as we will now see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Varying the bandwidth using the `bw` argument"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The bandwidth of the kernel can be adjusted using the `bw` argument.\n",
    "In the following example, a bandwidth of `bw=0.2` seems to fit the data well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "# Plot the histrogram\n",
    "ax.hist(obs_dist, bins=25, label='Histogram from samples',\n",
    "        zorder=5, edgecolor='k', density=True, alpha=0.5)\n",
    "\n",
    "# Plot the KDE for various bandwidths\n",
    "for bandwidth in [0.1, 0.2, 0.4]:\n",
    "    kde.fit(bw=bandwidth) # Estimate the densities\n",
    "    ax.plot(kde.support, kde.density, '--', lw=2, color='k', zorder=10,\n",
    "            label='KDE from samples, bw = {}'.format(round(bandwidth, 2)))\n",
    "\n",
    "# Plot the true distribution\n",
    "ax.plot(kde.support, true_values, lw=3, label='True distribution', zorder=15)\n",
    "\n",
    "# Plot the samples\n",
    "ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size))/50,\n",
    "           marker='x', color='red', zorder=20, label='Data samples', alpha=0.5)\n",
    "\n",
    "ax.legend(loc='best')\n",
    "ax.set_xlim([-3, 3])\n",
    "ax.grid(True, zorder=-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing kernel functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above, a Gaussian kernel was used. Several other kernels are also available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.nonparametric.kde import kernel_switch\n",
    "list(kernel_switch.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The available kernel functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a figure\n",
    "fig = plt.figure(figsize=(12, 5))\n",
    "\n",
    "# Enumerate every option for the kernel\n",
    "for i, (ker_name, ker_class) in enumerate(kernel_switch.items()):\n",
    "\n",
    "    # Initialize the kernel object\n",
    "    kernel = ker_class()\n",
    "\n",
    "    # Sample from the domain\n",
    "    domain = kernel.domain or [-3, 3]\n",
    "    x_vals = np.linspace(*domain, num=2**10)\n",
    "    y_vals = kernel(x_vals)\n",
    "\n",
    "    # Create a subplot, set the title\n",
    "    ax = fig.add_subplot(2, 4, i + 1)\n",
    "    ax.set_title('Kernel function \"{}\"'.format(ker_name))\n",
    "    ax.plot(x_vals, y_vals, lw=3, label='{}'.format(ker_name))\n",
    "    ax.scatter([0], [0], marker='x', color='red')\n",
    "    plt.grid(True, zorder=-5)\n",
    "    ax.set_xlim(domain)\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The available kernel functions on three data points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now examine how the kernel density estimate will fit to three equally spaced data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create three equidistant points\n",
    "data = np.linspace(-1, 1, 3)\n",
    "kde = sm.nonparametric.KDEUnivariate(data)\n",
    "\n",
    "# Create a figure\n",
    "fig = plt.figure(figsize=(12, 5))\n",
    "\n",
    "# Enumerate every option for the kernel\n",
    "for i, kernel in enumerate(kernel_switch.keys()):\n",
    "\n",
    "    # Create a subplot, set the title\n",
    "    ax = fig.add_subplot(2, 4, i + 1)\n",
    "    ax.set_title('Kernel function \"{}\"'.format(kernel))\n",
    "\n",
    "    # Fit the model (estimate densities)\n",
    "    kde.fit(kernel=kernel, fft=False, gridsize=2**10)\n",
    "\n",
    "    # Create the plot\n",
    "    ax.plot(kde.support, kde.density, lw=3, label='KDE from samples', zorder=10)\n",
    "    ax.scatter(data, np.zeros_like(data), marker='x', color='red')\n",
    "    plt.grid(True, zorder=-5)\n",
    "    ax.set_xlim([-3, 3])\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A more difficult case\n",
    "\n",
    "The fit is not always perfect. See the example below for a harder case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_dist = mixture_rvs([.25, .75], size=250, dist=[stats.norm, stats.beta],\n",
    "            kwargs = (dict(loc=-1, scale=.5), dict(loc=1, scale=1, args=(1, .5))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde = sm.nonparametric.KDEUnivariate(obs_dist)\n",
    "kde.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "ax.hist(obs_dist, bins=20, density=True, edgecolor='k', zorder=4, alpha=0.5)\n",
    "ax.plot(kde.support, kde.density, lw=3, zorder=7)\n",
    "# Plot the samples\n",
    "ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size))/50,\n",
    "           marker='x', color='red', zorder=20, label='Data samples', alpha=0.5)\n",
    "ax.grid(True, zorder=-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The KDE is a distribution\n",
    "\n",
    "Since the KDE is a distribution, we can access attributes and methods such as:\n",
    "\n",
    "- `entropy`\n",
    "- `evaluate`\n",
    "- `cdf`\n",
    "- `icdf`\n",
    "- `sf`\n",
    "- `cumhazard`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_dist = mixture_rvs([.25, .75], size=1000, dist=[stats.norm, stats.norm],\n",
    "                kwargs = (dict(loc=-1, scale=.5), dict(loc=1, scale=.5)))\n",
    "kde = sm.nonparametric.KDEUnivariate(obs_dist)\n",
    "kde.fit(gridsize=2**10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde.entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde.evaluate(-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cumulative distribution, it's inverse, and the survival function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "\n",
    "ax.plot(kde.support, kde.cdf, lw=3, label='CDF')\n",
    "ax.plot(np.linspace(0, 1, num = kde.icdf.size), kde.icdf, lw=3, label='Inverse CDF')\n",
    "ax.plot(kde.support, kde.sf, lw=3, label='Survival function')\n",
    "ax.legend(loc = 'best')\n",
    "ax.grid(True, zorder=-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Cumulative Hazard Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 5))\n",
    "ax = fig.add_subplot(111)\n",
    "ax.plot(kde.support, kde.cumhazard, lw=3, label='Cumulative Hazard Function')\n",
    "ax.legend(loc = 'best')\n",
    "ax.grid(True, zorder=-5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
